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*ldea of disclosure V v : j^t:^ 

1 . Describe your invention, stating the problem solved (if appropriate), and indicating the advantages of 
using the invention. 

The rate that code is being developed for existing platforms is increasing each year. Investment in this 
software can often exceed the cost of the actual system. In addition, when new hardware is created, 
often a time lag exists from when the architecture is available and the time to migrate to the new 
architecture. Often, to support this legacy code, the processors keep a portion of the silicon dedicated to 
legacy logic that maintains backward compatibility to the prior machine instructions. The machine is 
placed on the market with the potential of achieving greater performance once the new software is 
obtained. As such, the performance gain of follow on processor architectures are often not realized by 
the end user. 

This disclosure proposes a new hardware and software solution to more fully utilize the newer 
architecture features of a microprocessor. The mechanism will work real time while the pr ogram is being 
run by the system. The essential idea is that a fully integrated structure is included into the system / 
architecture including the processor that will allow code translation of legacy code to enhanced code. ( 
Unlike prior attempts, this mechanism is not performed instream to the instruction fetches. It uses 
another approach that avoids the penalties that occur whenever real time code optimization is attempted. 

At the system level, the optimization occurs as the program is being loaded into the main memory, the L3 
cache, the L2 cache, and the L1 cache. While the program is running, the external optimizer is optimizing 
a page at a time of the legacy code. This optimization occurs even while the original non-optimized code 
is running. When the code is optimized, then the legacy page is replaced with the optimized page. From 
that point forward, the user program would see the performance boost of the optimized page. 

A mechanism is built in to maintain a copy of the optimized page for future relf^ing-oHhe-app«GatiQa 
This includes a mechanism to store the optimized code a nd processor state Icr ^on-volatilej tgjggel^ 
device when the system is shut down. When the system is restarted, the user can choose tocontinue 
from this same state so it doesnl have to reoptimize the code. This allows a program to be optimized 
once and simply re-executed whenever it is used in the future. 

A description of the invention is provided below. 

2. How does the invention solve the problem or achieve an advantage,(a description of "the invention", 
including figures inline as appropriate)? 

i) Startup 

Upon nWKine start up, the operating system starts loading the requested programs from the permanent 
memory space, usually a hard disk area. These programs are the applications selected by the user. 
Assuming that no prior optimized pages exist, the program code will be non optimized. The target 
processor that the request is made is in this order. The processor calculates a real address, and looks up 
in its page table to see if the page has been generated before. If not, a new page address will be 
generated, and the memory load from the hard disk will occur. For optimum performance, the load will 
satisfied ASAP from the hard disk space to external main memory, to the L3 cache (optional), to the L2 
cache, to the L1 cache, and to the dispatch unit. The processor will continue to fetch memory locations to 
run the selected applications, and valid not optimized pages will exists for the loaded applications. In 
most cases, these will be incomplete page locations in the caches, but will be a complete page in the 
Main memory. At this stage of operation, the machine will function in same manner as legacy machines. 

ii) Instruction Optimizer 

The Optimizer consists of a "processor" and associated control logic whose task is to convert 
non-optimized instructions into a stream of instructions that execute faster on the target processor. 
Collectively, the "processor^ and control logic are referred to as the Instruction Recoder. 
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The Instruction Recoder ^tMy separate fro phhe rest of the system - it may operate at a separate clock 
speed from the target processor and it is not affected by occurences within the main system (e.g., 
interrupts, exceptions). It simply rpaHcig^tn irtin^cfmm a page of main memory, resched ules th em for 
higher performance and siore&hgi ffi^ n Wpgjg^Sp main memory. The InstrucOOfTRecbder carrbe~~~ 
implemgDted4rUwo ways: ^Microcontroller thjw executes recoding algorithm stored in ROM or 2) 
^^ar^wired logic J ^ 

SintelS^fFTthe Instnjction Recoder and target processor have access to main memory, this invention 
requires that main memory be multiported (two read ports and two write ports). This allows both 
processes to operate at the same time without causing structural hazards. In other words, the Instruction 
Recoder can read one non-optimized instnjction and write one optimized instruction per cycle, and the 
target processor can read/write one Hem from this same memory element. In another embodiment, the 
Instruction Recoder requests a priority interrupt to the memory subsystem whenever it needs to access 
main memory, and it is granted access when the target processor is not using the memory bus. 

The Instruction Recoder must be able to detect self-modifying code a nd read accordingly to it. This can 
be implemented by a direct communication from the target processor to the Instnjction Recoder, or it can 
be detected by the Instruction Recoder when the L2/L3 caches write back to main memory. In any case, 
the detection of modified code causes the Instiudggn^RgeodRr taJ) Ofaeck its optimized code to ensure it 
is still valid and make changes if needed QR2 ]TP5abieoptimization fo rjfae entire section of 
self-modifying code. In essence, the Instruction Kecoaer must accountfor any code changes after the 
non-optimized instructions are loaded into main memory. 

The actual work of optimizing code is straightforward. The Instruction Recoder knows the architecture of 
the target processor, including the following things: 

Number of execution units, types of units (e.g., 4 integer, 2 floating point, 2 multimedia) 
Latency of each operation (e.g., 1 cycle for add, 3 cycles for multiply) 
Number of architected registers, number of ports on register files and caches 
Therefore it knows how much parallelism is in the target processor, how fast operations execute and the 
limits for getting data to/from the processor. Knowing this information, the Instruction Recoder analyzes 
the non-optimized instruction strea m to find a "more efficient" way to sc hedule the instructions. "More 
efficient" means that the new codfftififize s more resources in parallel {p revents execution units from 
going idle) and it has less pipelinestattsnne instruction Hecoaer uses common techniques used by 
compilers (such as Loop Unrolling) to improve the scheduling of instructions. If the target processor has 
a superior architecture than the one assumed for the non-optimized code, then the Instnjction Recoder 
should be able to improve the instruction scheduling such that the Optimized code executes faster than 
the Non-optimized code. In addition, because the act of optimizing code is not in the l-fetch path for the 
target pf&cd5$or, there is no penalty for creating the optimized code. In fact, this is transparent to the 
user untiltaro et processor starts executing the Optimized code , and then they experience the 
performance improvement. 

iii) Mechanism to write optimized code to lower level caches (L1 and L2) 

Once the optimized page has fully been updated in the main memory fin other words, a duplicate 
page has been created, but has been 

optimized I for ttejiey^ the system caches must be updated. Since these are 

pages in which are^ewiy assigned^ V 

there is zero risk tolonnlng^pncations toypreloajWnese pages into the L3, L2, and the L1 . The system 
logic has full access to the L3, and ^ — 

often the 12 caches so the preloading of these caches could follow existing protocol. Preloading the L1 
cache would not be required, but 

t up t ii ro iif pe tf o fmacu^may be desired. The mechanism to preload the Ujwouldbe t o use standa rd^ 
nctionsjfthe ~ ~ 




DM A cycle steal functionsj fthere 

i syslenf memory management would monitor the state of the processor function. Whenever 
bandwidth was available on the bus, the 
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system memory management system would DMAC the update page entries into the L1 cache. Again, 
since these are new pages, the 

running applications would not risk corrupting the running application. 

iv) Mechanism to update page table in Main Memo^ and TLB in CPU 

Once a su^epf^H^in of the new page ^ias been updatedjnl^ structure, the old legacy 

page is ready to be swapped out ^~~--T~ ~~ 

for the new optimized page. Within the processor is a /Page Table j uffer.jthat stores a n number of 
active pages. These pages are stored 
in the TLB to reduce the numbers of time that the page translation address generation would occur. The 

larger the TLB, the more efficient ^ — * 

will be the memory transactions. This invention ^dds^JDMAC^ path into the TLB.jhe memory system 

already knows the page ^enffyihat — \ . ? T > - ~ 

,__wasunpw optimized. A XJ-B DMAC cycles occur wfiere the existing TLB page entry is invalidated. In the 

J simplest case this would bersoffieient— — V_ 

to pick up the new page since the processor would now have to go to the operating system and 
recalculate the page address, and in this 

case, the assigned page address could be the new optimized page entry. The preferred embodiment, 
the<^ys]mi^r&^ "\ 

the new optimized page entry into the processor TLB. Then validate thepaw page, and 
^hvalidatettjelegacy page. Now, when — ^ 
C_tbe^in0arto page lookup occurs, and the index address is in a safe match, or adjusted (described in 
Jack section) the new opimized page will 

lid and the processor will continue at this point. 

A circuit will exist on the processor to allow up to n address to be loaded for address translation/jump to 
a new location within the optimized 

page. This buffer will be preloaded by the system memory control with the DMAC protocol or via a 10 
serial port asjttJJgtfioa^H^ 
registers 'The Safe transition comp arator^ !! be comparing current legacy page index with translated 
indexes and swii ij if a s to re d match 

occurs. If match occurs, then the page swap will be allowed as described, and the new t ranslated index 
will be substrtut^of-the^ — * 
^_jndexrtmffislTOnner the code will be functionally re-aligned and operation can continue from the new 
optimized page. Since the new 

optimil&d page is preloaded, the user will not see a cache miss penalty, BUT from this point on would 
have improved performance for 
this page of the applications. 




v) Optional mechanism to invalidate non-optimized code in L1/L2 caches to free up space 

The legacy page will now be stale date in the L1 , L2 and L3 caches. Two methods could exists to free 
up this space. One method would 

be to allow the existing LRU to remove the entries. Since no further requests will occur for these pages 
(since the new optimized pages 

are now the valid translations), the data will become LRU naturally, and be removed. An alternative 
means could be implemented so that 
o nce a leg acy page has beenswapped with the optimized page, the cache control logic could pro 
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active update all entries in the 

cache tags to look up page index (stored in the Tags), and when it recognizes a replaced tag entry (in a 
stored register saying last 

replaced page), the LRU value would be update for the line entry to mark the legacy page as LRU. In 
this manner, if a legacy page 

had been recently used, and some other valid data still exists for that line, the legacy line entry will be 
marked for replacement and not 

the still valid entry. This optional portion of the invention could improve L1 , and even L2 cache hit rates. 



vi) Method to write CPU state and optimized code to virtual disk at shutdown 

Upon shut down, additional hardware resources can be allocated to preserve the current optimized 
state of the user application codes. 

If the system is powered down without this restoration means, then the next time the system is powered 
up, the user will start with the 

non-optimized legacy code, and the machine learning/optimizing of the application will have to re-occur. 
Some additional claims of 

this invention is the structure to maintain the optimized state of the processor. The valid optimized page 
entries oopt^e^mej^ ^ 

will beared in a new optimized page address merrory/the system upon shutdown, will read the Page 
Entry Memory^PEK^^ 7— 7 „ - — \ 

write these locations of main memory into ajRecovery Optimized Page Disk (ROPD)l In addition the 
PEM contents will be saved I 

onto the ROPD. ^— 



Upon restart, the system will first recover the contents of the ROPD into the main memory, but not into 
the caches. Now, as the user 

runs prior optimized code, the code will still exist in main memory, and the user will always see the 
optimized results, without the optimizer 
having to repeat the whole page optimization routine. 



vii) Method to control the speed of the optimizer (# of instructions optimized per second) 
How much effort is done to optimize each instruction? Is the optimizer sufficiently ahead of the 
processor? If it is, then the optimizer spends more time optimizing the code. Otherwise, the optimizer 
reducesTts Effort. This is useful at startup when the processor and optimizer are working with the same 
instructions. 

One option the user may have when loading applications is to be asked whether or not to allow the 
optimizer to optimizer the page before allowing 

any execution of the page. While this is not the core of the invention, this could be allowed to occur for 
applications that are not timing critical. 

Due to the unique nature of this invention, code can begin executing immediately upon the processor 
without the enhanced features of the processor, and then in real time, the code is shifted over to the 
enhanced architecture mode without the need of disrupting the service of the 

machine. The user can have the choice of saving the upgrade, so that future system use will not have the 
initial performance penalty. 
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